{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Audio Feature Extractions\n\n``torchaudio`` implements feature extractions commonly used in the audio\ndomain. They are available in ``torchaudio.functional`` and\n``torchaudio.transforms``.\n\n``functional`` implements features as standalone functions.\nThey are stateless.\n\n``transforms`` implements features as objects,\nusing implementations from ``functional`` and ``torch.nn.Module``.\nThey can be serialized using TorchScript.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nimport torchaudio\nimport torchaudio.functional as F\nimport torchaudio.transforms as T\n\nprint(torch.__version__)\nprint(torchaudio.__version__)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Preparation\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>When running this tutorial in Google Colab, install the required packages\n\n   .. code::\n\n      !pip install librosa</p></div>\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from IPython.display import Audio\nimport librosa\nimport matplotlib.pyplot as plt\nfrom torchaudio.utils import download_asset\n\ntorch.random.manual_seed(0)\n\nSAMPLE_SPEECH = download_asset(\"tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav\")\n\n\ndef plot_waveform(waveform, sr, title=\"Waveform\"):\n    waveform = waveform.numpy()\n\n    num_channels, num_frames = waveform.shape\n    time_axis = torch.arange(0, num_frames) / sr\n\n    figure, axes = plt.subplots(num_channels, 1)\n    axes.plot(time_axis, waveform[0], linewidth=1)\n    axes.grid(True)\n    figure.suptitle(title)\n    plt.show(block=False)\n\n\ndef plot_spectrogram(specgram, title=None, ylabel=\"freq_bin\"):\n    fig, axs = plt.subplots(1, 1)\n    axs.set_title(title or \"Spectrogram (db)\")\n    axs.set_ylabel(ylabel)\n    axs.set_xlabel(\"frame\")\n    im = axs.imshow(librosa.power_to_db(specgram), origin=\"lower\", aspect=\"auto\")\n    fig.colorbar(im, ax=axs)\n    plt.show(block=False)\n\n\ndef plot_fbank(fbank, title=None):\n    fig, axs = plt.subplots(1, 1)\n    axs.set_title(title or \"Filter bank\")\n    axs.imshow(fbank, aspect=\"auto\")\n    axs.set_ylabel(\"frequency bin\")\n    axs.set_xlabel(\"mel bin\")\n    plt.show(block=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Overview of audio features\n\nThe following diagram shows the relationship between common audio features\nand torchaudio APIs to generate them.\n\n<img src=\"https://download.pytorch.org/torchaudio/tutorial-assets/torchaudio_feature_extractions.png\">\n\nFor the complete list of available features, please refer to the\ndocumentation.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Spectrogram\n\nTo get the frequency make-up of an audio signal as it varies with time,\nyou can use :py:func:`torchaudio.transforms.Spectrogram`.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "SPEECH_WAVEFORM, SAMPLE_RATE = torchaudio.load(SAMPLE_SPEECH)\n\nplot_waveform(SPEECH_WAVEFORM, SAMPLE_RATE, title=\"Original waveform\")\nAudio(SPEECH_WAVEFORM.numpy(), rate=SAMPLE_RATE)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "n_fft = 1024\nwin_length = None\nhop_length = 512\n\n# Define transform\nspectrogram = T.Spectrogram(\n    n_fft=n_fft,\n    win_length=win_length,\n    hop_length=hop_length,\n    center=True,\n    pad_mode=\"reflect\",\n    power=2.0,\n)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Perform transform\nspec = spectrogram(SPEECH_WAVEFORM)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_spectrogram(spec[0], title=\"torchaudio\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## GriffinLim\n\nTo recover a waveform from a spectrogram, you can use ``GriffinLim``.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "torch.random.manual_seed(0)\n\nn_fft = 1024\nwin_length = None\nhop_length = 512\n\nspec = T.Spectrogram(\n    n_fft=n_fft,\n    win_length=win_length,\n    hop_length=hop_length,\n)(SPEECH_WAVEFORM)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "griffin_lim = T.GriffinLim(\n    n_fft=n_fft,\n    win_length=win_length,\n    hop_length=hop_length,\n)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "reconstructed_waveform = griffin_lim(spec)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_waveform(reconstructed_waveform, SAMPLE_RATE, title=\"Reconstructed\")\nAudio(reconstructed_waveform, rate=SAMPLE_RATE)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Mel Filter Bank\n\n:py:func:`torchaudio.functional.melscale_fbanks` generates the filter bank\nfor converting frequency bins to mel-scale bins.\n\nSince this function does not require input audio/features, there is no\nequivalent transform in :py:func:`torchaudio.transforms`.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "n_fft = 256\nn_mels = 64\nsample_rate = 6000\n\nmel_filters = F.melscale_fbanks(\n    int(n_fft // 2 + 1),\n    n_mels=n_mels,\n    f_min=0.0,\n    f_max=sample_rate / 2.0,\n    sample_rate=sample_rate,\n    norm=\"slaney\",\n)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_fbank(mel_filters, \"Mel Filter Bank - torchaudio\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Comparison against librosa\n\nFor reference, here is the equivalent way to get the mel filter bank\nwith ``librosa``.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "mel_filters_librosa = librosa.filters.mel(\n    sr=sample_rate,\n    n_fft=n_fft,\n    n_mels=n_mels,\n    fmin=0.0,\n    fmax=sample_rate / 2.0,\n    norm=\"slaney\",\n    htk=True,\n).T"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_fbank(mel_filters_librosa, \"Mel Filter Bank - librosa\")\n\nmse = torch.square(mel_filters - mel_filters_librosa).mean().item()\nprint(\"Mean Square Difference: \", mse)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## MelSpectrogram\n\nGenerating a mel-scale spectrogram involves generating a spectrogram\nand performing mel-scale conversion. In ``torchaudio``,\n:py:func:`torchaudio.transforms.MelSpectrogram` provides\nthis functionality.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "n_fft = 1024\nwin_length = None\nhop_length = 512\nn_mels = 128\n\nmel_spectrogram = T.MelSpectrogram(\n    sample_rate=sample_rate,\n    n_fft=n_fft,\n    win_length=win_length,\n    hop_length=hop_length,\n    center=True,\n    pad_mode=\"reflect\",\n    power=2.0,\n    norm=\"slaney\",\n    onesided=True,\n    n_mels=n_mels,\n    mel_scale=\"htk\",\n)\n\nmelspec = mel_spectrogram(SPEECH_WAVEFORM)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_spectrogram(melspec[0], title=\"MelSpectrogram - torchaudio\", ylabel=\"mel freq\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Comparison against librosa\n\nFor reference, here is the equivalent means of generating mel-scale\nspectrograms with ``librosa``.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "melspec_librosa = librosa.feature.melspectrogram(\n    y=SPEECH_WAVEFORM.numpy()[0],\n    sr=sample_rate,\n    n_fft=n_fft,\n    hop_length=hop_length,\n    win_length=win_length,\n    center=True,\n    pad_mode=\"reflect\",\n    power=2.0,\n    n_mels=n_mels,\n    norm=\"slaney\",\n    htk=True,\n)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_spectrogram(melspec_librosa, title=\"MelSpectrogram - librosa\", ylabel=\"mel freq\")\n\nmse = torch.square(melspec - melspec_librosa).mean().item()\nprint(\"Mean Square Difference: \", mse)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## MFCC\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "n_fft = 2048\nwin_length = None\nhop_length = 512\nn_mels = 256\nn_mfcc = 256\n\nmfcc_transform = T.MFCC(\n    sample_rate=sample_rate,\n    n_mfcc=n_mfcc,\n    melkwargs={\n        \"n_fft\": n_fft,\n        \"n_mels\": n_mels,\n        \"hop_length\": hop_length,\n        \"mel_scale\": \"htk\",\n    },\n)\n\nmfcc = mfcc_transform(SPEECH_WAVEFORM)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_spectrogram(mfcc[0])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Comparison against librosa\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "melspec = librosa.feature.melspectrogram(\n    y=SPEECH_WAVEFORM.numpy()[0],\n    sr=sample_rate,\n    n_fft=n_fft,\n    win_length=win_length,\n    hop_length=hop_length,\n    n_mels=n_mels,\n    htk=True,\n    norm=None,\n)\n\nmfcc_librosa = librosa.feature.mfcc(\n    S=librosa.core.spectrum.power_to_db(melspec),\n    n_mfcc=n_mfcc,\n    dct_type=2,\n    norm=\"ortho\",\n)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_spectrogram(mfcc_librosa)\n\nmse = torch.square(mfcc - mfcc_librosa).mean().item()\nprint(\"Mean Square Difference: \", mse)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## LFCC\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "n_fft = 2048\nwin_length = None\nhop_length = 512\nn_lfcc = 256\n\nlfcc_transform = T.LFCC(\n    sample_rate=sample_rate,\n    n_lfcc=n_lfcc,\n    speckwargs={\n        \"n_fft\": n_fft,\n        \"win_length\": win_length,\n        \"hop_length\": hop_length,\n    },\n)\n\nlfcc = lfcc_transform(SPEECH_WAVEFORM)\nplot_spectrogram(lfcc[0])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Pitch\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "pitch = F.detect_pitch_frequency(SPEECH_WAVEFORM, SAMPLE_RATE)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def plot_pitch(waveform, sr, pitch):\n    figure, axis = plt.subplots(1, 1)\n    axis.set_title(\"Pitch Feature\")\n    axis.grid(True)\n\n    end_time = waveform.shape[1] / sr\n    time_axis = torch.linspace(0, end_time, waveform.shape[1])\n    axis.plot(time_axis, waveform[0], linewidth=1, color=\"gray\", alpha=0.3)\n\n    axis2 = axis.twinx()\n    time_axis = torch.linspace(0, end_time, pitch.shape[1])\n    axis2.plot(time_axis, pitch[0], linewidth=2, label=\"Pitch\", color=\"green\")\n\n    axis2.legend(loc=0)\n    plt.show(block=False)\n\n\nplot_pitch(SPEECH_WAVEFORM, SAMPLE_RATE, pitch)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Kaldi Pitch (beta)\n\nKaldi Pitch feature [1] is a pitch detection mechanism tuned for automatic\nspeech recognition (ASR) applications. This is a beta feature in ``torchaudio``,\nand it is available as :py:func:`torchaudio.functional.compute_kaldi_pitch`.\n\n1. A pitch extraction algorithm tuned for automatic speech recognition\n\n   Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal and S.\n   Khudanpur\n\n   2014 IEEE International Conference on Acoustics, Speech and Signal\n   Processing (ICASSP), Florence, 2014, pp.\u00a02494-2498, doi:\n   10.1109/ICASSP.2014.6854049.\n   [[abstract](https://ieeexplore.ieee.org/document/6854049)_],\n   [[paper](https://danielpovey.com/files/2014_icassp_pitch.pdf)_]\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "pitch_feature = F.compute_kaldi_pitch(SPEECH_WAVEFORM, SAMPLE_RATE)\npitch, nfcc = pitch_feature[..., 0], pitch_feature[..., 1]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def plot_kaldi_pitch(waveform, sr, pitch, nfcc):\n    _, axis = plt.subplots(1, 1)\n    axis.set_title(\"Kaldi Pitch Feature\")\n    axis.grid(True)\n\n    end_time = waveform.shape[1] / sr\n    time_axis = torch.linspace(0, end_time, waveform.shape[1])\n    axis.plot(time_axis, waveform[0], linewidth=1, color=\"gray\", alpha=0.3)\n\n    time_axis = torch.linspace(0, end_time, pitch.shape[1])\n    ln1 = axis.plot(time_axis, pitch[0], linewidth=2, label=\"Pitch\", color=\"green\")\n    axis.set_ylim((-1.3, 1.3))\n\n    axis2 = axis.twinx()\n    time_axis = torch.linspace(0, end_time, nfcc.shape[1])\n    ln2 = axis2.plot(time_axis, nfcc[0], linewidth=2, label=\"NFCC\", color=\"blue\", linestyle=\"--\")\n\n    lns = ln1 + ln2\n    labels = [l.get_label() for l in lns]\n    axis.legend(lns, labels, loc=0)\n    plt.show(block=False)\n\n\nplot_kaldi_pitch(SPEECH_WAVEFORM, SAMPLE_RATE, pitch, nfcc)"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}